GitBucket
4.21.2
Toggle navigation
Snippets
Sign in
Files
Branches
1
Releases
Wiki
nigel.stanger
/
Wiki
Compare Revisions
View Page
Back to Page History
Transcribing lectures using Whisper.md
Recording quality is critical. Poor recordings with dropouts can lead to confused transcript timings requiring significant post-editing. Whisper can often end up creating very short text chunks, leading to a kind “rapid fire” effect. Assuming a good quality recording, the following settings seem to do a good job: * Medium model. This performs better than the large model when transcribing English because the large model isn’t English-specific. * Enable word timestamps. * Explicitly set the maximum number of words per line. 20 seems slightly too short, but realistically any value is probably going to cause some awkward “text breaks”. * VTT output. EchoVideo supports other formats, but VTT is supported by everything and is easy to work with. (Needs a better VS Code extension, though.) * Disable FP16 if running on Apple M1 or M2 as they don’t have support for it. Should be fine on everything else. For example: ```sh whisper --model medium --language English --output_format vtt --fp16 False --word_timestamps True --max_words_per_line 30 <input-file> ``` VS Code extension issues: * Missing feature: merge subtitles. Merges selected subtitles into one and adjusts timestamps accordingly. * Bug: Sometimes adjusting timing leads to timestamps like `00:40:48.1000`, which should actually be `00:40:49.000`. Clearly there is something slightly wonky with the arithmetic.
Recording quality is critical. Poor recordings with dropouts can lead to confused transcript timings requiring significant post-editing. Whisper can often end up creating very short text chunks, leading to a kind “rapid fire” effect. Assuming a good quality recording, the following settings seem to do a good job: * Medium model. This performs better than the large model when transcribing English because the large model isn’t English-specific. * Enable word timestamps. * Explicitly set the maximum number of words per line. 20 seems slightly too short, but realistically any value is probably going to cause some awkward “text breaks”. * VTT output. EchoVideo supports other formats, but VTT is supported by everything and is easy to work with. (Needs a better VS Code extension, though.) * Disable FP16 if running on Apple M1 or M2 as they don’t have support for it. Should be fine on everything else. For example: ```sh whisper --model medium --language English --output_format vtt --fp16 False --word_timestamps True --max_words_per_line 30 <input-file> ```